Motivations

Unemployment rate.

In this example, there is a correlation between the level of education and the unemployment rate. We observe a correlation between two variables: More education = less unemployment, but this does not necessarily mean that there is a cause-and-effect relationship (causality) between these variables. Unmeasured factors such as the economy, politics, or social conditions can influence both education and unemployment without there being a direct link between them.

- Data: OECD Study (Organisation for Economic Co-operation and Development) - Input: PISA Score - Output: unemployment rate

Chocolate consumption per capita.

In the following case, we observe a linear correlation between two variables, such as the number of Nobel Prizes in a country and chocolate consumption. It is unlikely that this implies a direct causal relationship between these variables. This situation illustrates the principle that correlation does not prove causation.

  • Data: 22 countries
  • Input: Chocolate consumption per capita
  • Output: Number of Nobel Prizes

The Ozone Example (With )

This dataset (Air Breizh, Summer 2001) contains 112 observations and 13 variables.

library(rmarkdown)
ozone1 <- read.table("ozone1.txt",header=TRUE, sep="", dec=",")
paged_table(ozone1)

Let’s correct the incorrect declaration of variables:

💡 Good to Know!

  • With , to change a column to date type, you first need to convert it to a factor, then specify the format.

  • Here, across is used with where(is.character) to target only the columns that are of character type (chr) and apply as.numeric to these columns.


library(dplyr)
ozone1$Date <- as.Date(as.factor(ozone1$Date),format = "%Y%m%d")
ozone1$vent<-as.factor(ozone1$vent)
ozone1$pluie<-as.factor(ozone1$pluie)
ozone1 <- ozone1 %>% mutate(across(where(is.character), as.numeric))
ozone1 <- ozone1 %>% mutate(across(where(is.integer), as.numeric))
str(ozone1)
## 'data.frame':    112 obs. of  14 variables:
##  $ Date  : Date, format: "2001-06-01" "2001-06-02" "2001-06-03" ...
##  $ maxO3 : num  87 82 92 114 94 80 79 79 101 106 ...
##  $ T9    : num  15.6 17 15.3 16.2 17.4 17.7 16.8 14.9 16.1 18.3 ...
##  $ T12   : num  18.5 18.4 17.6 19.7 20.5 19.8 15.6 17.5 19.6 21.9 ...
##  $ T15   : num  18.4 17.7 19.5 22.5 20.4 18.3 14.9 18.9 21.4 22.9 ...
##  $ Ne9   : num  4 5 2 1 8 6 7 5 2 5 ...
##  $ Ne12  : num  4 5 5 1 8 6 8 5 4 6 ...
##  $ Ne15  : num  8 7 4 0 7 7 8 4 4 8 ...
##  $ Vx9   : num  0.695 -4.33 2.954 0.985 -0.5 ...
##  $ Vx12  : num  -1.71 -4 1.879 0.347 -2.954 ...
##  $ Vx15  : num  -0.695 -3 0.521 -0.174 -4.33 ...
##  $ maxO3v: num  84 87 82 92 114 94 80 99 79 101 ...
##  $ vent  : Factor w/ 4 levels "Est","Nord","Ouest",..: 2 2 1 2 3 3 3 2 2 3 ...
##  $ pluie : Factor w/ 2 levels "Pluie","Sec": 2 2 2 2 2 1 2 2 2 2 ...

Objective. Predict the maximum ozone concentration in the air (maxO3). Initially, let’s focus solely on the temperatures recorded at 12 PM (T12). Find a function \(f\) such that

maxO3 \(\approx f\)(T12)

with \(X\) = T12: Temperatures recorded at 12 PM and \(Y\) = maxO: Maximum ozone concentration in the air.

The function \(f\) is called the regression function!

Question: To which family of functions does \(f\) belong?

  • Affine? (Which one?)
  • Polynomial? (Degree?)
  • Spline? (Degree?)
  • Kernel? (Which one?)

Find the function that minimizes the sum of the squared distances from the function to the points \((\text{T12}_i, \text{maxO}_i)\).

\[\widehat{f}\in\arg\min_{f} \sum_{i=1}^n(\text{maxO}_i-f(\text{T12}_i))^2\]

1. Linear regression model

1.1. Simple linear model

Let’s consider the simple linear model, such that that

maxO3 \(\approx\) affine function of T12 \(\Leftrightarrow\) \(y \approx f(x) \approx \beta_0 + \beta_1 x\), where \(x\) = T12 and \(y\) = maxO3.

Let \(i=1,\cdots, n\), with \(n\) the number of observations \((\text{maxO}_i,\text{T12}_i)\), we define the simple linear model as follows \[ y_i=\beta_0+\beta_1x_i+\epsilon_i, \] where

  • \(x_i\) = T12 is called the explanatory variable, regressor, predictor, covariate, etc.

  • \(y\) = maxO3 is called the response variable, variable to explain, the target, the label, etc.

  • \(\epsilon\) is called the error, theoretical residual

  • \(\beta_0\) is called the intercept

  • \(\beta_1\) is called the slope or weight of the variable \(x\)


Matrix form. Let define the following quantities

  • The response vector \(\mathbf{y}= \left(y_1\ \cdots \ y_n\right)^\top\in\mathbb{R}^n\)
  • The error vector \(\boldsymbol{\epsilon} = \left(\epsilon_1\ \cdots \ \epsilon_n\right)^\top\in\mathbb{R}^n\)
  • The unknown regression coefficients vector \(\boldsymbol{\beta}=(\beta_0,\beta_1)^\top\)
  • The covariate \(\mathbf{X}=(\text{T12}_1,\cdots,\text{T12}_n)^\top\in\mathbb{R}^n\) and denote by \(\mathbf{x}_i=(1,\text{T12}_i)^\top\in\mathbb{R}^2\) and
  • We denote by \(\boldsymbol{1}_n=(1,\cdots,1)^\top\) the one vector in \(\mathbb{R}^n\).
  • We define the Design matrix s.t \[\mathbb{X} = \begin{pmatrix} 1& \text{T12}_1 \\ \vdots&\vdots\\ 1 & \text{T12}_n\\ \end{pmatrix}= (\boldsymbol{1}_n,\mathbf{X})=\left( \begin{array}{c} \mathbf{x}^\top_1 \\ \vdots\\ \mathbf{x}^\top_n \end{array} \right)\in\mathbb{R}^{n\times2} \] Then, \[\mathbf{y}=\beta_0\boldsymbol{1}_n+\mathbf{X}\beta_1+\boldsymbol{\epsilon}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\]


Problem \(\boldsymbol{\beta}=(\beta_0, \beta_1)^\top\) is unknown!

Solution Estimate them from our sample of \(n\) observations.


Modeling involves assuming that the \(y_i\) are realizations of random variables \(Y_i\) whose distribution we specify.

The simple linear gaussian model. Let \((Y_i,x_i)_{i=1,\cdots,n}\), we define the simple linear gaussian model as follows \[ Y_i=\beta_0+\beta_1x_i+\epsilon_i,\quad i=1,\cdots,n \] where the following Assumptions are satisfied:

  • [P1]: The \(\epsilon_i\) are centered.
  • [P2]: The \(\epsilon_i\) have a constant variance (noise level) \(\sigma^2>0\) (homoscedasticity).
  • [P3]: The \(\epsilon_i\) are uncorrelated.
  • [P4]: The \(\epsilon_i\) are Gaussian.

These assumptions can be summarized as \(\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\).

The simple linear gaussian model (matrix form).

\[\mathbf{Y}=\beta_0\boldsymbol{1}_n+\mathbf{X}\beta_1+\boldsymbol{\epsilon}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\] where the following Assumptions are satisfied:

  • [P1]: \(\text{E}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\boldsymbol{0}_n\) .
  • [P2-P3]: \(\text{Var}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\sigma^2\mathbf{I}_n\)
  • [P4]: \(\boldsymbol{\epsilon}\) is Gaussian.

These assumptions can be summarized as \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2\mathbf{I}_n)\).

Prediction with the estimated model. \[\widehat{Y}_i = \widehat{f}\big(x_i\big) = \widehat{\beta}_0 + \widehat{\beta}_1 x_i\]


COMMENTS

📝Assumption [P1]. In practice, this means that the model is correct and that we have not forgotten a relevant term : the model is linear.

📝Assumption [P2]. We speak of a Homoscedastic model, as opposed to a heteroscedastic model where the error term would not have the same variance for all observations.

📝Assumption [P3]. Thus, the observations are assumed to be uncor- related, i.e. independent sampling or the results of a physical experiment conducted under independent conditions. Problems can arise when time has an importance in the phenomenon.

📝Assumption [P4]. This postulate is the least important as we will be able to do without it if the number of observations is large (larger than 20 or 30 observations). It is difficult to detect the non gaussianity of errors. We will see that under propose graphical tools to try to validate or not this postulate.


Ordinary least squares (OLS) estimator

To find \((\widehat{\beta}_0, \widehat{\beta}_1)^\top\), we minimize the sum of squared errors from the model \[Y_i = f(x_i) + \epsilon_i = \beta_0 + \beta_1 x_i + \epsilon_i, \qquad i = 1, \cdots, n\] That is, \((\widehat{\beta}_0, \widehat{\beta}_1)\) are the arguments that minimize the sum of squared residuals

\[ \begin{array}{l} \widehat {\boldsymbol{\beta}}=(\widehat{\beta}_0, \widehat{\beta}_1) &\in \underset{(\beta_0, \beta_1)}{\arg\min} \sum_{i=1}^n (Y_i - f(X_i))^2\\ &= \underset{(\beta_0, \beta_1)}{\arg\min} \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2=\underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min} \|\mathbf{Y}-\mathbb{X}\boldsymbol{\beta}\|^2\\ &=\underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min} \sum_{i=1}^n \epsilon_i^2=\underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min} \|\boldsymbol{\epsilon}\|^2 \end{array} \]

This estimator is called the least squares estimator, which is also the maximum likelihood estimator in this context.


Let’s plot this line (here with ) is called the least squares line.

library(ggplot2)
library(plotly)
p1<-ggplot(ozone1) + aes(x = T12, y = maxO3) + geom_point(col = 'red', size = 0.5) + geom_smooth(method = "lm", se = FALSE)
ggplotly(p1)

With , we can compute \(\widehat {\boldsymbol{\beta}}=(\widehat{\beta}_0, \widehat{\beta}_1)\)

library(kableExtra)
mod <- lm(maxO3 ~ T12, data = ozone1)
mod$coefficients %>% kbl(col.names = c("Estimation"))
Estimation
(Intercept) -27.419636
T12 5.468685


1.2. Multivariate Linear Regression

The Ozone dataset actually contains much more information. In addition to the response variable maxO3 and the covariate \(X_1 :=\)T9, there are \(X_2 :=\)T12, \(X_3:=\)T15, \(X_4 :=\)Ne9, \(X_5:=\)Ne12, \(X_6 :=\)Ne15, \(X_7 :=\)Vx9, \(X_8:=\)Vx12, and \(X_9:=\)Vx15.

Define by \(\mathbf{y} = \left(y_1\ \cdots \ y_n\right)^\top\) and for \(p=9\) the design matrix is such that \[ \mathbb{X}=(\boldsymbol{1}_n,\mathbf{X}_1,\cdots,\mathbf{X}_p)=\begin{pmatrix} 1 & x_{11} & \cdots & x_{1p}\\ \vdots & \vdots&\vdots&\vdots\\ 1 & x_{i1} & \cdots & x_{ip}\\ \vdots & \vdots&\vdots&\vdots\\ 1 & x_{n1} & \cdots & x_{np} \\ \end{pmatrix}=\begin{pmatrix} \mathbf{x}^\top_1 \\ \vdots\\ \mathbf{x}^\top_n \end{pmatrix}\in\mathbb{R}^{n\times (p+1)} \]

The multivariate linear gaussian regression model. Let \((Y_i,x_i)_{i=1,\cdots,n}\), we define the simple linear gaussian model as follows \[ Y_i=\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip}+\epsilon_i,\quad i=1,\cdots,n \]

The multivariate linear gaussian regression model. (matrix form).

\[\mathbf{Y}=\beta_0\boldsymbol{1}_n+\mathbf{X}_1\beta_1+\cdots+\mathbf{X}_p\beta_p+\boldsymbol{\epsilon}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\] where the following Assumptions are satisfied:

  • [P1]: The \(\epsilon_i\) are centered.
  • [P2]: The \(\epsilon_i\) have a constant variance (noise level) \(\sigma^2>0\) (homoscedasticity).
  • [P3]: The \(\epsilon_i\) are uncorrelated.
  • [P4]: The \(\epsilon_i\) are Gaussian.

These assumptions can be summarized as \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2\mathbf{I}_n)\).

Prediction with the estimated model. \[\widehat{Y}_i = \widehat{f}\big(x_i\big) = \widehat{\beta}_0 + \widehat{\beta}_1 x_{i1}+\cdots+\widehat{\beta}_p x_{ip}\quad \text{and}\quad\widehat{\mathbf{Y}}=\mathbb{X}\widehat {\boldsymbol{\beta}}\]


COMMENTS

📝 we could use only the original covariates or create new ones by transforming the existing ones (take the square or the logarithm, etc. . . ) . This is called feature engineering.

📝 Thus, a linear model does not mean that the relationship between the predictors and the response variable is linear, but that the model is linear in the \(\beta_j\) parameters.

2. Ordinary least squares estimator (OLSE)

2.1. Definition and rank assumption

Ordinary least squares (OLS) estimator

The OLS estimator is such that \[ \widehat {\boldsymbol{\beta}} \in \underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min} \|\mathbf{Y}-\mathbb{X}\boldsymbol{\beta}\|^2= \underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min}\sum_{i=1}^n \big(Y_i -(\beta_0+ \sum_{j=1}^px_{ij}\beta_j)\big)^2 \]

📝 Note that! For the course, we assume that \((p+1)\leq n\).

Rank assumption. The matrix \(\mathbb{X}\) is full rank, \(\text{rank}(\mathbb{X}) = p+1\).


COMMENTS

📝 The model is identifiable if for any \(\boldsymbol{\beta}\) and \(\boldsymbol{\beta}'\) in \(\mathbb{R}^{p+1}\) such that \(\mathbb{X}\boldsymbol{\beta}=\mathbb{X}\boldsymbol{\beta}'\), then \(\boldsymbol{\beta}=\boldsymbol{\beta}'\). This guarantees the uniqueness of the regression vector. This is a very important hypothesis and is realized as soon as the matrix \(\mathbb{X}\) is injective, which is equivalent to \(\text{rank}(\mathbb{X}) = p+1\)

📝 This hypothesis is crucial if we want to study the importance of the covariate but not necessary for the prediction problem : Predict future values of Y based on future observations of predictors.

2.2. closed form

[Proposition 1]:

Let \(\mathbb{X}\) be a design matrix of size \(n \times (p+1)\) with full rank (\(\text{rank}(\mathbb{X}) = p+1\)). Let \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), where \(\boldsymbol{\beta} \in \mathbb{R}^{p+1}\) is a linear model. Then the least squares estimator is unique \[ \widehat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta} \in \mathbb{R}^{p+1}} \|\mathbf{Y} - \mathbb{X} \boldsymbol{\beta}\|^2 \] and can be written as \[ \widehat{\boldsymbol{\beta}} = \widehat{\boldsymbol{\beta}}(\mathbf{Y}) = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} \]

Proof. Will be done in the tutorial.\(\square\)

COMMENTS

📝 Let \(\mathbb{X}\) be the \(n\times (p+1)\) design matrix and denote by \([X]:=\mathcal{I}m(\mathbb{X}),\) the space generated by the \(p+1\) columns of \(X\).

📝 Let denote by \(P_X\) the orthogonal projection into \([X]\). Then, \(P_{X^{\perp}}=\mathbf{I}_n-r_X\) is the orthogonal projection matrix into \([X]^{\perp}\) the orthogonal space at \([X]\) \[\widehat{\mathbf{Y}}=\mathbb{X}\widehat{\boldsymbol{\beta}}=\mathbb{X}(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top\mathbf{Y}=P_X\,\mathbf{Y},\] then Note \(\widehat{\mathbf{Y}}=\mathbb{X}\widehat{\boldsymbol{\beta}}\) is the orthogonal projection of \(\mathbf{Y}\) into \([X]\):

📝 Note that in the simplest model reduced to the intercept \(\mathbf{Y} ={\boldsymbol{\beta}}\mathbb{1}_n + \boldsymbol{\epsilon}\), The least square estimator of the real parameter \(\beta\) is \(\widehat{\beta}=\overline{\mathbf{Y}}\).

2.3. Declaration of the model with Under

Let’s revisit the ozone example and, for this chapter, select only the numerical regressors with

names(ozone1)
##  [1] "Date"   "maxO3"  "T9"     "T12"    "T15"    "Ne9"    "Ne12"   "Ne15"   "Vx9"    "Vx12"  
## [11] "Vx15"   "maxO3v" "vent"   "pluie"
Ozone<-ozone1[,2:12]
names(Ozone)
##  [1] "maxO3"  "T9"     "T12"    "T15"    "Ne9"    "Ne12"   "Ne15"   "Vx9"    "Vx12"   "Vx15"  
## [11] "maxO3v"

Under ,we can declare the linear model in two ways: either by naming the variables or by using a ..

mod<-lm(maxO3~T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v,data=Ozone)
mod<-lm(maxO3~.,data=Ozone)

Now, let’s display the estimated values \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\).

coefficients(mod)
## (Intercept)          T9         T12         T15         Ne9        Ne12        Ne15         Vx9 
## 12.24441987 -0.01901425  2.22115189  0.55853087 -2.18909215 -0.42101517  0.18373097  0.94790917 
##        Vx12        Vx15      maxO3v 
##  0.03119824  0.41859252  0.35197646

2.3. Statistical Properties of OLSE under the Assumptions

[Proposition 2]:

Let us consider the linear regression model \(\mathbf{Y} = \mathbb{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\). Under the Rank assumption and [P1], the LSE \(\widehat{\boldsymbol{\beta}}\) is an unbiased estimator of \(\boldsymbol{\beta}\): \[\forall\,\boldsymbol{\beta}\in\mathbb R^{p+1},\quad \text{E}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}]=\boldsymbol{\beta}.\]


Proof. As \(\mathbb{X}\) is a deterministic matrix and \(\mathbf{Y} = \mathbb{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\), it comes \[\begin{eqnarray*} \text{E}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}]&=&\text{E}_{\boldsymbol{\beta}}[(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top\mathbf{Y}] =\text{E}_{\boldsymbol{\beta}}[(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top (\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon})]\\ &=&(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbb{X}\boldsymbol{\beta}+(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \text{E}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\boldsymbol{\beta}. \end{eqnarray*}\] As \((\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbb{X}=\mathbf{I}_{p+1}\) and \(\text{E}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\boldsymbol{0}_n\square\)

[Proposition 3]: Let \(\mathbf{Y}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\), awith \(\boldsymbol{\beta}\in\mathbb{R}^{p+1}\) and \(\mathbb{X}\) the \(n\times (p+1)\) full rank design matrix, then \(\widehat {\boldsymbol{\beta}}=\widehat {\boldsymbol{\beta}}(\mathbf{Y})=(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbf{Y}\) is such that under [P1] (\(\text{E}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\boldsymbol{0}_n\)), [P2]-[P3] (\(\text{Var}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\sigma^2\mathbf{I}_n\)) \[ \text{Var}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}]=\sigma^2(\mathbb{X}^\top \mathbb{X})^{-1} \]


Proof. \[\begin{eqnarray*} \text{Var}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}] &=&\text{Var}_{\boldsymbol{\beta}}[(\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y}]\\ & = & (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \text{Var}_{\boldsymbol{\beta}}(\mathbf{Y}) ((\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top )^\top \\ &=& (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \sigma^2 \mathbf{I}_n X ((\mathbb{X}^\top \mathbb{X})^{-1})^\top\quad \mbox{ as } \text{Var}_{\boldsymbol{\beta}}(\mathbf{Y}) = \sigma^2 \mathbf{I}_n \\ &=& \sigma^2 (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top X (\mathbb{X}^\top \mathbb{X})^{-1}\quad \mbox{ as } (\mathbb{X}^\top \mathbb{X})^{-1} \mbox{ is symetric } \\ &=& \sigma^2 (\mathbb{X}^\top \mathbb{X})^{-1}\square \end{eqnarray*}\]

[Theorem 1: (Gauss-Markov theorem)]:

Let \(\mathbb{X}\) be a design matrix of size \(n \times (p+1)\) with full rank (\(\text{rank}(\mathbb{X}) = p+1\)). Let \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\beta} \in \mathbb{R}^{p+1}\) being a linear model. Assuming [P1]-[P2]-[P3] are satisfied, the ordinary least squares estimator \[ \widehat{\boldsymbol{\beta}} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} \] is optimal in the class \(\mathcal{E}\) of unbiased linear estimators (i.e., linear combinations of the observations), i.e., for all \[ \lambda^\top \left(\text{Var}_{\boldsymbol{\beta}}[\widetilde{\boldsymbol{\beta}}] - \text{Var}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}] \right) \lambda \geq 0, \quad \forall \widetilde{\boldsymbol{\beta}} \in \mathcal{E}, \quad \forall \lambda \in \mathbb{R}^{p+1}. \]


Proof. Let \(C\) be a matrix of size \(r \times n\) (\(r = p+1\)). Let \(\widetilde{\boldsymbol{\beta}} = C \mathbf{Y} = C \mathbb{X} \boldsymbol{\beta} + C \boldsymbol{\epsilon}\) be an unbiased linear estimator. We have: \[\begin{eqnarray*} &&\text{E}_{\boldsymbol{\beta}}[C \mathbf{Y}] - \boldsymbol{\beta} = 0 \Leftrightarrow C \mathbb{X} \boldsymbol{\beta} - \boldsymbol{\beta} = \boldsymbol{0}_r \\ &&\Leftrightarrow (C \mathbb{X} - \mathbf{I}_r) \boldsymbol{\beta} = \boldsymbol{0}_p \Leftrightarrow C \mathbb{X} = \mathbf{I}_r \, \text{ and } \, \mathbb{X}^\top C^\top = \mathbf{I}_r \end{eqnarray*}\] Thus, for \(P_X\) and \(P_{X^\perp}\), the orthogonal projector onto \([X]\) and \([X]^\perp\), we have: \[\begin{eqnarray*} &&\text{Var}_{\boldsymbol{\beta}}(\widetilde{\boldsymbol{\beta}}) - \text{Var}_{\boldsymbol{\beta}}(\widehat{\boldsymbol{\beta}}) \\ &=& \sigma^2 (CC^\top - (\mathbb{X}^\top \mathbb{X})^{-1}) = \sigma^2 C (\mathbf{I}_n - \mathbb{X} (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top) C^\top \\ &=& \sigma^2 C (\mathbf{I}_n - P_X) C^\top = \sigma^2 C P_{X^\perp} C^\top = \sigma^2 C P_{X^\perp} P_{X^\perp}^\top C^\top \end{eqnarray*}\] since \(P_{X^\perp} = P_{X^\perp}^2 = P_{X^\perp} P_{X^\perp}^\top\) (properties of the projector). Therefore, for all \(\lambda \in \mathbb{R}^{p+1}\), we have: \[ \lambda^\top (\text{Var}_{\boldsymbol{\beta}}(\widetilde{\boldsymbol{\beta}}) - \text{Var}_{\boldsymbol{\beta}}(\widehat{\boldsymbol{\beta}})) \lambda = \sigma^2 \lambda^\top C P_{X^\perp} P_{X^\perp}^\top C^\top \lambda = \|P_{X^\perp}^\top C^\top \lambda\|^2 \geq 0. \qquad \square \]

📝 In the context of the Gauss-Markov theorem, the Ordinary Least Squares Estimator (OLSE) is considered the Best Linear Unbiased Estimator (BLUE).

[Proposition 4]:

Let \(\mathbb{X}\) be a design matrix of size \(n \times (p+1)\) with full rank (\(\text{rank}(\mathbb{X}) = p+1\)). Let \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\beta} \in \mathbb{R}^{p+1}\) being a linear model. Assuming [P1]-[P2]-[P3] are satisfied, then

\[\widehat{\boldsymbol{\beta}}\sim \mathcal{N}(\boldsymbol{\beta},\sigma^2(\mathbb{X}^\top \mathbb{X})^{-1}))\]


Proof. As \(\mathbf{Y}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\) is a linear transformation of the gaussian vector \(\boldsymbol{\epsilon}\). we conclude by Propositions 2 and 3.\(\square\)

3. Residuals analysis

Residuals (or estimated residuals) we define the residuals as follows: \[\widehat{\boldsymbol{\epsilon}}=\mathbf{Y}-\widehat{\mathbf{Y}}=(\mathbf{I}_n-P_X)\mathbf{Y}=P_{X^\perp}\mathbf{Y}\] where \(\widehat{\mathbf{Y}}\) is called the fitted value \[\widehat{\mathbf{Y}}=\mathbb{X}\widehat{\boldsymbol{\beta}}=\mathbb{X}(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbf{Y}=P_X\,\mathbf{Y}\]


summary(mod$residuals)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -53.566  -8.727  -0.403   0.000   7.599  39.458

3.1. Properties of \(\widehat{\mathbf{Y}}\) and \(\widehat{\boldsymbol{\epsilon}}\)

[Theorem 2]: We consider the linear regression model \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), under the rank assumption \(\text{rank}(X) = p+1 = r\) and [P1][P4], then i.e., \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2 \mathbf{I}_n)\).

Let \(\widehat{\mathbf{Y}} = X\widehat{\boldsymbol{\beta}}(\mathbf{Y}) = P_X\,\mathbf{Y}\) and \(\widehat{\boldsymbol{\epsilon}} = \mathbf{Y} - \widehat{\mathbf{Y}} = P_{X^\perp}\mathbf{Y}\). Then:

  1. \(\widehat{\mathbf{Y}} \sim \mathcal{N}(\mathbb{X}\boldsymbol{\beta}, \sigma^2 P_X) \quad \text{and} \quad \widehat{\boldsymbol{\epsilon}} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2 P_{X^\perp})\).

  2. \(\widehat{\mathbf{Y}}\) is independent of \(\widehat{\boldsymbol{\epsilon}}\),

  3. \(\frac{\|\widehat{\mathbf{Y}} - \mathbb{X}\boldsymbol{\beta} \|^2}{\sigma^2}\) is independent of \(\frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{\sigma^2}\), \(\quad \frac{\|\widehat{\mathbf{Y}} - \mathbb{X}\boldsymbol{\beta} \|^2}{\sigma^2} \sim \chi^2_r \quad \text{and} \quad \frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{\sigma^2} \sim \chi^2_{n-r}\).

  4. \(\widehat{\boldsymbol{\epsilon}}\) is independent of \(\widehat{\boldsymbol{\beta}}\).


Proof.

Points 1,2 and 3. Note that \(\mathbf{Y} \sim \mathcal{N}(\mathbb{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)\). Moreover \(\quad P_X \mathbf{Y} = \widehat{\mathbf{Y}}\), and \(P_{X^\perp} \mathbf{Y} = \widehat{\boldsymbol{\epsilon}}\), therefore we conclude by Cochran’s theorem with

  • \(F = [X]\), \(Z = \mathbf{Y}\), and \(\boldsymbol{\mu} = \mathbb{X}\boldsymbol{\beta}\).
  • \(P_X \boldsymbol{\mu} = P_X \mathbb{X}\boldsymbol{\beta} = \mathbb{X}\boldsymbol{\beta}\) and \(P_{X^\perp} \boldsymbol{\mu} = P_{X^\perp} \mathbb{X}\boldsymbol{\beta} = \boldsymbol{0}_n\).

4. As \[ (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top P_X \mathbf{Y} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbb{X} (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} = \widehat{\boldsymbol{\beta}}. \]

and since \(P_X \mathbf{Y}\) is independent of \(P_{X^\perp} \mathbf{Y}\)\(\square\)

3.2. Properties of \(\widehat \sigma^2\) Estimation of \(\sigma^2\)

[Proposition 5]:

We consider the linear regression model \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), under the rank assumption \(\text{rank}(X) =( p+1 )= r\quad\) and [P1][P4], then i.e., \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2 \mathbf{I}_n)\). An unbiased estimator of \(\sigma^2\) is \(\widehat{\sigma}^2\):

\[ \widehat{\sigma}^2 = \frac{\|\mathbf{Y} - X \widehat{\boldsymbol{\beta}} \|^2}{n-r} = \frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{n-r}. \]


Proof.

\[ \widehat{\sigma}^2 = \frac{\|\mathbf{Y} - X \widehat{\boldsymbol{\beta}} \|^2}{n-r} = \frac{\|\mathbf{Y} - P_X \mathbf{Y} \|^2}{n-r} = \frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{n-r} = \frac{\|P_{X^\perp} \mathbf{Y} \|^2}{n-r}. \]

Since \(\sigma^{-1} \boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \mathbf{I}_n)\), by Cochran’s theorem, we have

\[ \|P_{X^\perp} (\sigma^{-1} \boldsymbol{\epsilon}) \|^2 \sim \chi^2_{n-r}. \]

Thus, since \(\text{E}_{\boldsymbol{\beta}}[\chi^2_{n-r}] = n-r\) and

\[ \text{E}_{\boldsymbol{\beta}}[\|\widehat{\boldsymbol{\epsilon}} \|^2] = \text{E}_{\boldsymbol{\beta}}[\|P_{X^\perp} \boldsymbol{\beta} \|^2] = \sigma^2 \text{E}_{\boldsymbol{\beta}}[\|P_{X^\perp} (\sigma^{-1} \boldsymbol{\epsilon}) \|^2] = \sigma^2 (n-r).\square \]

[Theorem 3]:

We consider the linear regression model \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), under the rank assumption \(\text{rank}(X) = p+1 = r\) and [P1][P4], then i.e., \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2 \mathbf{I}_n)\).

  1. \(\frac{(n-r)\widehat\sigma^2}{\sigma^2} \sim \chi^2_{n-r}\)
  2. \(\text{Var}(\widehat \sigma^2) = \frac{2 \sigma^4}{n-r}\)
  3. \(\widehat \sigma^2\text{ independant of }\widehat{\mathbf{Y}}\)
  4. \(\widehat \sigma^2\text{ independant of }\widehat{\boldsymbol{\beta}}\)


Proof.

  1. Note that \(\frac{\|P_{ X^\perp}Y\|^2}{n-r} = \frac{\|P_{ X^\perp}\boldsymbol{\epsilon}\|^2}{n-r}\).

Next \(\sigma^{-1}\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{0}_n,\mathbf{I}_n)\) and \(P_{ X^\perp}\) is an orthogonal projector of rank \((n-r)\). Then by Cochran’s theorem \[ \|P_{ X^\perp}(\sigma^{-1}\boldsymbol{\epsilon})\|^2\sim \chi^2_{(n-r)}. \] Then \[\text{E}_{\boldsymbol{\beta}}[\|\widehat{\boldsymbol{\epsilon}}\|^2]=\text{E}_{\boldsymbol{\beta}}[\|P_{ X^\perp}\boldsymbol{\epsilon}\|^2] =\sigma^2\text{E}_{\boldsymbol{\beta}}[\|P_{ X^\perp}(\sigma^{-1}\boldsymbol{\epsilon})\|^2] = \sigma^2\text{E}_{\boldsymbol{\beta}}[\chi^2_{(n-r)}] =\sigma^2\,(n-r).\]

  1. \(\widehat{\sigma}^2 = \frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{n-r} = \frac{\sigma^2 \|\widehat{\boldsymbol{\epsilon}} \|^2}{\sigma^2 (n-r)}\) \[ \text{Var}_{\boldsymbol{\beta}}[\widehat{\sigma}^2] = \frac{\sigma^4}{(n-r)^2} \text{Var}_{\boldsymbol{\beta}} \left[ \chi^2_{n-r} \right] = \frac{2\sigma^4}{n-r} \]

Points 3. and 4. By Cochran’s Theorem, \(P_{X^\perp} \mathbf{Y}\) is independent of \(P_X \mathbf{Y}\). Therefore, as \(\widehat{\sigma}^2 = \frac{\|P_{X^\perp} \mathbf{Y} \|^2}{n-r}\), and \(\widehat{\boldsymbol{\beta}} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top P_X \mathbf{Y}\), we conclude. \(\square\)

We can display it with

cat("Residual standard error:", summary(mod)$sigma, "on", summary(mod)$df[2], "degrees of freedom\n")
## Residual standard error: 14.36002 on 101 degrees of freedom

📈 Here Residual standard erroris the value of \(\widehat\sigma\), then \(\widehat{\sigma}^2=14.36002^2\), and 101 degrees of freedom is the degrees of freedom of the \(\chi^2_{n-r}\) law

3.3. Maximum Likelihood Estimator (MLE)

[Proposition 6]:

The maximum likelihood estimator of \((\boldsymbol{\beta},\sigma^2)\) is \((\widehat{\boldsymbol{\beta}}_{ML},\widehat\sigma_{ML}^2)\) where \[ \widehat{\boldsymbol{\beta}}_{ML} = (\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbf{Y},\quad \widehat\sigma_{ML}^2=\frac{\|P_{X^\perp}\mathbf{Y}\|^2}{n} \]


COMMENTS

📝 In this gaussian setting, the maximum likelihood estimator (MLE) \(\widehat\beta\) is the LSE.

📝 Note that the MLE of \(\sigma^2\) is a biased estimator and equals to \[\widehat\sigma_{ML}^2=\frac{\|\mathbf{Y}-\mathbb{X}\widehat{\boldsymbol{\beta}}\|^2}{n}=\frac{\|\widehat{\boldsymbol{\epsilon}}\|^2}{n}.\] To be compared with the unbiased estimator: \[ \widehat\sigma^2=\frac{\|\mathbf{Y}-\mathbb{X}\widehat{\boldsymbol{\beta}}\|^2}{n-r}=\frac{\|\widehat{\boldsymbol{\epsilon}}\|^2}{n-r}. \]

[Theorem 4]:

Let us consider the linear regression model \(\mathbf{Y}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\). Assume \(X\) is full rank \(r=(p+1)\) and [P1][P4]. We have

  1. \(\forall c\in\mathbb{R}^r\), we have \[ \frac{c^\top \widehat{\boldsymbol{\beta}}-c^\top{\boldsymbol{\beta}}}{\widehat\sigma\sqrt{c^\top (\mathbb{X}^\top \mathbb{X})^{-1}c}}\sim t_{n-r}. \]

  2. \(\forall C \in \mathbb{R}^{q\times r}\) of full rank \(q\) (\(q\leq r\)), we have \[ \frac{(C\widehat{\boldsymbol{\beta}}-C\boldsymbol{\beta})^\top (C(\mathbb{X}^\top \mathbb{X})^{-1}C^\top)^{-1}(C\widehat{\boldsymbol{\beta}}-C\boldsymbol{\beta})}{q\;\widehat\sigma^2}\sim{\text{F}}_{(q,n-r)}. \]


Proof.

  1. By Proposition 4, \(\widehat\beta\sim\mathcal{N}(\beta,\sigma^2(X\top X)^{-1})\), then \(c^\top\widehat\beta\sim\mathcal{N}(c^\top\beta,\sigma^2c^\top(X\top X)^{-1}c)\) as a linear transformation of the previous gaussian vector. Therefore \[\left\{ \begin{array}{l} \eta:=\frac{c^\top\widehat\beta-c^\top\beta}{\sqrt{\sigma^2c^\top(X\top X)^{-1}c)}}\sim\mathcal{N}(0,1)\\ \mathcal{X}:=\frac{(n-r)\widehat\sigma^2}{\sigma^2}\sim\chi^2(n-r)\quad\text{by Theorem 3 }\\ \eta\text{ is independant of }\mathcal{X}\quad\text{by Theorem 3 } \end{array}\right. \] Therefore \[ A:=\frac{\eta}{\sqrt{\mathcal{X}/(n-r))}}=\frac{c^\top\widehat\beta-c^\top\beta}{\sqrt{\widehat\sigma^2c^\top(X\top X)^{-1}c}}\sim t_{n-r}\square \]

  2. Will be done in the tutorial.\(\square\)

📝 This result will be very important for building confidence regions and hypothesis testing.

ANNEXE

A.1. Some useful distributions

Definition [\(\chi^2\) distribution]

Let \(Z_1,\ldots,Z_n\) be i.i.d. \(\mathcal{N}(0,1)\). Then \(\sum_{i=1}^nZ_i^2\sim \chi^2_n\) follows a \(\chi^2\)-distribution with \(n\) degrees of freedom.


Definition [Student \(t\) distribution]

Let \(Z\sim \mathcal{N}(0,1)\) and \(U \sim \chi^2_m\) with \(Z\) independant of \(U\). Then \[ \frac{Z}{\sqrt{U/(m)}}\sim t_{m}. \]


Definition [Fisher \(\text{F}\) distribution]

Let \(U\sim \chi^2_p\) and \(V\sim \chi^2_q\) and \(U\) independant of \(V\). Then \[ F = \frac{U/p}{V/q} \sim \text{F}_{(p,q)}. \]

follows a Fisher distribution with parameters \((p,q)\).


A.2. Cochran’s Theorem

[Cochran’s Theorem]:

Let \(Z\) be a Gaussian random vector in \(\mathbb{R}^n\) with \(Z \sim \mathcal{N}(\boldsymbol{\mu}, \sigma^2 \mathbf{I}_n)\), where \(\boldsymbol{\mu} \in \mathbb{R}^n\) and \(\sigma > 0\). Let \(F_{1},\ldots ,F_{m}\) be subspaces of dimension \(d_i\), orthogonal to each other such that \(\mathbb {R} ^{n} = F_1 \oplus \dots \oplus F_m\). For \(i=1,\ldots,m\), let \(P_{F_i}\) denote the orthogonal projection matrix onto \(F_i\). Then:

  1. The random vectors \(P_{F_{1}}Z,\ldots ,P_{F_{m}}Z\) have respective distributions \(\mathcal{N}(P_{F_{1}}\boldsymbol{\mu}, \sigma^2 P_{F_{1}}),\ldots ,\mathcal{N}(P_{F_{m}}\boldsymbol{\mu}, \sigma^2 P_{F_{m}})\).

  2. The random vectors \(P_{F_{1}}Z,\ldots ,P_{F_{m}}Z\) are pairwise independent.

  3. The random variables \(\frac{\|P_{F_{1}}(Z-\boldsymbol{\mu})\|^2}{\sigma^2},\ldots ,\frac{\|P_{F_{m}}(Z-\boldsymbol{\mu})\|^2}{\sigma^2}\) have respective distributions \(\chi^2(d_{1}),\ldots ,\chi^2(d_{m})\).

  4. The random variables \(\frac{\|P_{F_{1}}(Z-\boldsymbol{\mu})\|^2}{\sigma^2},\ldots ,\frac{\|P_{F_{m}}(Z-\boldsymbol{\mu})\|^2}{\sigma^2}\) are pairwise independent.


Proof. Will be done in the tutorial.\(\square\)